Skip to content

[Develop] Introduce Global Cleanup IAM Role for ParallelCluster Build-Image #6912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

hehe7318
Copy link
Contributor

@hehe7318 hehe7318 commented Jul 9, 2025

Description of changes

  • Introduce ensure_cleanup_role() with revision tagging and idempotent creation / update logic.
    • 4-step safe update sequence documented, now only after the inline policy succeeds, set or bump the revision tag.
    • Lambda VPCAccess managed policy is attached only when LambdaFunctionsVpcConfig exists in the config
  • Modify image_operations_controller to invoke ensure_cleanup_role when Build/Iam/CleanupLambdaRole is not provided and to fail fast on permission errors.
  • Refactor imagebuilder_stack to remove all per-stack cleanup-role logic and wire Lambda to the global role by default.
  • Update constants (role prefix / expected revision tag key & value).
  • IamClient – add create_role, attach_role_policy, put_role_policy, tag_role

Tests

  • Deleted legacy test_imagebuilder_lambda_execution_role.
  • Fixed the failing tests
  • Added new unit tests to cover the new logic
  • Update existing integration test to test the build-image stack can be fully self deleted
  • Manually tested build-image command in following scenario and all passed with expected result:
    1. no custom cleanup role specified, and no cleanup role pre-exist
    2. no custom cleanup role specified, and cleanup role already exists with expected revision tag
    3. no custom cleanup role specified, and cleanup role already exists, but with outdated revision tag
    4. custom cleanup role specified
    5. no cleanup role pre-exist, and run build-image command with no enough permission role
    6. cleanup role already exists with expected revision tag, and run build-image command with no enough permission role
    7. cleanup role already exists but with outdated revision tag, and run build-image command with no enough permission role
  • Newly added and old unit tests all passed.

References

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Introduce ensure_cleanup_role() with revision tagging and idempotent
  creation / update logic.
* Modify image_operations_controller to invoke ensure_cleanup_role when
  Build/Iam/CleanupLambdaRole is not provided and to fail fast on
  permission errors.
* Refactor imagebuilder_stack to remove all per-stack cleanup-role logic
  and wire Lambda to the global role by default.
* Update constants (role prefix / expected revision tag key & value).
@hehe7318 hehe7318 requested review from a team as code owners July 9, 2025 21:10
@hehe7318 hehe7318 added the 3.x label Jul 9, 2025
@hehe7318 hehe7318 closed this Jul 9, 2025
@hehe7318 hehe7318 reopened this Jul 9, 2025
@hehe7318 hehe7318 marked this pull request as draft July 9, 2025 21:13
hehe7318 added 2 commits July 15, 2025 14:54
IAM / cleanup role
------------------
* IamClient – add create_role, attach_role_policy, put_role_policy, tag_role
* ensure_cleanup_role
  * switch from raw boto3 to AWSApi.instance().iam
  * 4-step safe update sequence documented, now only after the inline policy succeeds, set or bump the revision tag.
  * inline-policy builder now receives partition
* Lambda VPCAccess managed policy is attached only when LambdaFunctionsVpcConfig exists in the config

Tests & utils
-------------
* deleted legacy test_imagebuilder_lambda_execution_role.
* fixed the failing tests
* added new unit tests to cover the new logic
* update existing integration test to test the build-image stack can be fully self deleted
@hehe7318 hehe7318 marked this pull request as ready for review July 15, 2025 18:59
@hehe7318 hehe7318 changed the title [Develop][Drafting] Global Cleanup IAM Role for ParallelCluster Build-Image [Develop] Global Cleanup IAM Role for ParallelCluster Build-Image Jul 15, 2025
@hehe7318 hehe7318 changed the title [Develop] Global Cleanup IAM Role for ParallelCluster Build-Image [Develop] Introduce Global Cleanup IAM Role for ParallelCluster Build-Image Jul 15, 2025
@hehe7318 hehe7318 added the skip-bad-url-suffix-check Skip the checks regarding the bad URL suffix label Jul 15, 2025
CHANGELOG.md Outdated
@@ -13,6 +13,7 @@ CHANGELOG

**BUG FIXES**
- Fix an issue where Security Group validation failed when a rule contained both IPv4 ranges (IpRanges) and security group references (UserIdGroupPairs).
- Fix a bug that caused build-image CloudFormation stacks fail to delete after images are successfully built.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. after images are successfully built. it fails despite the outcome of the build image. even if the build-image fails, the stack is not able to get auto-deleted
  2. fail to delete I'd say fail to auto-delete. Otherwise it may seem that the user cannot even delete the stack manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved it to CHANGES.
- Introduce default global build-image cleanup IAM versioning roles to prevent build-image CloudFormation stacks from failing to be automatically deleted after images are either successfully built or fail to build.

Comment on lines +115 to +116
# If LambdaFunctionsVpcConfig exists in the config, attach the AWS-managed LambdaVPCAccess policy
has_lambda_functions_vpc_config = cfg_dict.get("DeploymentSettings", {}).get("LambdaFunctionsVpcConfig")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the lambda function vpc configuraitons?

Copy link
Contributor Author

@hehe7318 hehe7318 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See

if self.config.lambda_functions_vpc_config:
managed_lambda_policy.append(Fn.sub(LAMBDA_VPC_ACCESS_MANAGED_POLICY))

The cleanup role needs the AWSLambdaVPCAccessExecutionRole when LambdaFunctionsVpcConfig exists in the config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thank you.
Nitpick: we need this line only if the user is using the default role, so you can move the line into the if branch below.

raw_cfg_str = build_image_request_content["imageConfiguration"]
cfg_dict = yaml.safe_load(raw_cfg_str) or {}
# If CleanupLambdaRole exists in the config, skip ensure_cleanup_role
has_custom_cleanup_role = cfg_dict.get("Build", {}).get("Iam", {}).get("CleanupLambdaRole")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CodeStyle] Encapsulate the logic to check whether a custom role is set into a dedicated function and unit test it independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary. It's a common dig logic and the logic to test if it can correctly dig is already include in the unit test test_enable_cleanup_role_call_and_vpc_flag.

if not has_custom_cleanup_role:
try:
account_id = AWSApi.instance().sts.get_account_id()
ensure_cleanup_role(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CodeStyle] ensure_cleanup_role is a bit vague name for this function.
The name should reflect that iwe are talking about the default cleanup role and what cleanup role we are talking about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, modified to ensure_default_build_image_stack_cleanup_role

except AWSClientError as e:
if e.error_code in ("AccessDenied", "AccessDeniedException", "UnauthorizedOperation"):
raise BadRequestException(
"Current principal lacks permissions to create or update the ParallelCluster build-image "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] Why do we need to update the role>? Didn't we agreed on having versioned roles specifically to not update the policy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reached an agreement yesterday. Done. Changed the logic to create roles with revision in their name.

"Action": "sts:AssumeRole",
"Condition": {
"ArnLike": {
"aws:SourceArn": f"arn:{partition}:lambda:*:{account_id}:function:ParallelClusterImage-*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Security] This condition allows every function in the stack to assume this role, but we want only a specific function to be able to. Can we use a more specific name?

Copy link
Contributor Author

@hehe7318 hehe7318 Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we can't, the delete stack lambda function is named like "ParallelClusterImage-15a09820-6274-11f0-8162-0e903a4ee48d".

try:
resp = iam.get_role(role_name=role_name)
tags = {t["Key"]: t["Value"] for t in resp["Role"].get("Tags", [])}
current_revision = int(tags.get(CLEANUP_ROLE_REVISION_TAG_KEY, 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using the revision number specified in PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my approach, create role with tag=0. And the expected revision is 1. So that it can enter the if else block if current_revision < PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION to update the policy.
See my function description, in this approach, it can prevent tag updated but policy not match the expectation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer applicable. But the logic didn't change.

current_revision = int(tags.get(CLEANUP_ROLE_REVISION_TAG_KEY, 0))
except AWSClientError as e:
if e.error_code == "NoSuchEntity":
logging.info("Creating global cleanup role %s", role_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify that we are creating it because it does not exists. May sound obvious but since this is a critical resource better to specify why we take some decisions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

current_revision = 0
else:
raise e
# (re)attach AWSLambdaVPCAccessExecutionRole
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why re-attach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced yesterday. It's idempotent

iam.attach_role_policy(role_name, cleanup_role_basic_managed_policy)

# Put (or overwrite) inline policy
iam.put_role_policy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] Do we have a way to avoid policy updates? I thought that by using versioned roles we could do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have, and it has been applied. See if current_revision < PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_EXPECTED_REVISION: and the function description for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to have revision in the role name and do not update the role.

@hehe7318 hehe7318 removed the skip-bad-url-suffix-check Skip the checks regarding the bad URL suffix label Jul 16, 2025
@@ -10,6 +10,7 @@ CHANGELOG
- Support DCV on Amazon Linux 2023.
- Upgrade Python runtime used by Lambda functions to python3.12 (from python3.9).
- Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management.
- Introduce default global build-image cleanup IAM versioning roles to prevent build-image CloudFormation stacks from failing to be automatically deleted after images are either successfully built or fail to build.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should capture the following details in the changelog entry:

  1. the role is long-living (not bound to the stack lifecycle) and it is meant to persist the stack deletion.
  2. the role name: since the role is meant to persist it's important to communicate its name so that users do not think it is a left over when they see it.
  3. Since this is a long lasting issue in pcluster, it may be good to link the issue we are solving.

The fact that the role is versioned is an implementation details, more than something related to the user experience, so we can skip that detail.

The build-image command now deploys a global role that is used to automatically delete the build-image stack after images either succeed or fail the build. The role is meant to exists even after the stack has been deleted. This is to prevent build-image stack deletion failures, reported in https://github.com/aws/aws-parallelcluster/issues/5914

Not: since the changelog entry is long, use multilines to write it.

Comment on lines +115 to +116
# If LambdaFunctionsVpcConfig exists in the config, attach the AWS-managed LambdaVPCAccess policy
has_lambda_functions_vpc_config = cfg_dict.get("DeploymentSettings", {}).get("LambdaFunctionsVpcConfig")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thank you.
Nitpick: we need this line only if the user is using the default role, so you can move the line into the if branch below.


if not has_custom_cleanup_role:
try:
account_id = AWSApi.instance().sts.get_account_id()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this sts call fails, we would return an error message saying that the principal is missing the permissions to create/update the role. Make sure to add sts permissions to our documentation

"""Return the role name including a revision number."""
hashed_account_id = generate_string_hash(account_id)
return (
f"{PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_PREFIX}-{hashed_account_id}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This role is supposed to not be deleted or modified by the user.
What about adding a DO-NOT-DELETE suffix in the role name?
This is a standard approach, that we also adopted for the S3 bucket, which has a similar lifecycle.

What about PClusterBuildImageCleanupRole-v{revision}-{hash}-do-not-delete?
If it is too long, we can use PClusterImageCleanupRole-v{revision}-{hash}-do-not-delete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants